Skip to content

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant
Open

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant

Conversation

@dexhunter
Copy link
Copy Markdown

Summary

Key Innovations

  1. MuonEq-R — Row-normalizes gradient matrices before Newton-Schulz orthogonalization. Zero-byte cost, ~0.001 BPB improvement.
  2. Depth Recurrence — Layers 4,5 repeated with fully shared MLP weights (zero extra params). ~0.003 BPP improvement.
  3. Mixed Int5/Int6 GPTQ — Hessian sensitivity ranking: 60 int6 + 6 int5 layers for optimal size/quality tradeoff.

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed Steps ms/step Sliding BPB val_loss (nats) Artifact
1337 5,541 106.5 1.0939 2.51667 15,933,457
42 5,530 106.7 1.0922 2.51279 15,981,324
0 5,543 106.5 1.0927 2.51394 15,960,050
Mean 5,538 106.6 1.0929 2.51447 15,958,277

Changes from PR #1218

PR #1218 This
val_bpb 1.09785 1.09290 (-0.00495)
val_loss ~2.526 nats 2.514 nats (-0.011)
Optimizer Muon MuonEq-R
Depth recurrence None Layers 4,5
Mixed quantization No 60 int6 + 6 int5

Credits

Test plan

  • 3-seed verification (1337, 42, 0) — all pass artifact + time + score
  • All seeds under 16,000,000 bytes
  • Train < 600s, eval < 600s
  • No TTT, no SLOT, no forbidden techniques
  • Rule checker passed (log + script)

…1.0929 (3-seed mean)

Adds three techniques to PR openai#1218's 4096-vocab high-WD stack:
- MuonEq-R optimizer (row-norm before NS5 orthogonalization)
- Depth recurrence on layers 4,5 (shared MLP, zero extra params)
- Mixed int5/int6 GPTQ via Hessian sensitivity ranking

3-seed mean: 1.0929 BPB / 2.5145 nats
All seeds under 16MB (max: 15,981,324 bytes)
No TTT, no SLOT, no eval-time adaptation.
@mikeapedia
Copy link
Copy Markdown

Great submission @dexhunter! Did you happen to test muon column norm or row+column norm? I found R+C worked the best with the smaller vocab and I am wondering if that holds here as well.

HateBunnyPlzzz added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 2, 2026
Approaches revamped (old eval-only approaches removed):
- 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors)
- 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability)
- 03: SVD + Quantized Factors (13 layers via spectral compression)
- 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation)
- 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min)

Unmerged PR research saved to unmerged_runs/:
- PR openai#1263: SLOT (0.9354 BPB, legality contested)
- PR openai#1246: Trinity Ternary (0.9650 BPB)
- PR openai#1241: MDLM Diffusion (0.9901 BPB)
- PR openai#1252: WARP (1.0713 BPP)
- PR openai#1257: Complement Training (1.0855 BPB)
- PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB)
- PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB)
- PR openai#1254: XSA + LoRA TTT (1.1070 BPB)

Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Omrigotlieb added a commit to Omrigotlieb/parameter-golf that referenced this pull request Apr 3, 2026
Row-normalize the gradient update before Newton-Schulz orthogonalization.
From PR openai#1260: ~0.001 BPB free improvement, zero extra parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 3, 2026
… (3-seed mean)

Improves PR openai#1260 (1.0929) by using N_INT6=61 (one more int6 layer)
with a smaller mini runner (21,396 bytes) that creates enough headroom.

3-seed mean: 1.0924 BPB / 2.5133 nats (seeds 42, 0, 7)
All seeds under 16MB (max: 15,996,591 bytes)
No TTT, no SLOT, no eval-time adaptation.

Techniques: MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP),
61 int6 + 5 int5 Hessian-ranked GPTQ, brotli-11 compression.

Built on PR openai#1218 by @clarkkev.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 3, 2026
….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants